Stata freezing forever after -glm- command*

Sonnen Blume

Join Date: Aug 2018

Posts: 342
#1

Stata freezing forever after -glm- command*

12 Jan 2020, 23:29

Hi,

I am running a simple glm model with -link(log)- function, and it makes both Stata 14 and 15 irresponsive in both Windows and Mac.

Code:

glm ff2 i.Age i.Education i.Occ i.Residency , family(binomial) link(log) eform nolog

This doesn't happen if I change the link function to logit. Does anyone have any idea how this can be solved.

Thanks in advance for your insights.
Tags: None
Maarten Buis

Join Date: Mar 2014

Posts: 3458
#2

13 Jan 2020, 01:15

I suspect Stata does not freeze,but that glm has trouble finding a solution and continues to search for a long time. You can see that is the case by removing the nolog option. Why that might be the case depends on your data. Since you haven't said anything about that, there is nothing I can say.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2174
#3

13 Jan 2020, 04:42

I’m a little surprised Stata even allows a log link with a binomial family. The mean function is incompatible with the quasi-likelihood. For lots of outcomes on x and parameter values, the exponential mean function can exceed one. Whenever that happens the log likelihood is ill defined. How is ff2 measured? You need to change the link or the family.
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#4

13 Jan 2020, 04:53

It's hard to see the log link could be good for binomial responses. What's the logic there?
1 like
Comment
Sonnen Blume

Join Date: Aug 2018

Posts: 342
#5

13 Jan 2020, 04:57

Originally posted by Maarten Buis View Post

I suspect Stata does not freeze,but that glm has trouble finding a solution and continues to search for a long time. You can see that is the case by removing the nolog option. Why that might be the case depends on your data. Since you haven't said anything about that, there is nothing I can say.

Thanks Marteen. I just just tried the command without -nolog- and it shows the long line iterations, it finally ended in an error -convergence not achieved- . So perhaps there is something wrong with my equation.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3458
#6

13 Jan 2020, 05:02

Originally posted by Nick Cox View Post

It's hard to see the log link could be good for binomial responses. What's the logic there?

It can give you results in terms of risk ratios instead of odds ratios (logit link function). I am not saying that this is a particularly good reason, but it is a reason sometimes used for the log-link function when dealing with binary dependent variables. Personally I prefer odds ratios.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Sonnen Blume

Join Date: Aug 2018

Posts: 342
#7

13 Jan 2020, 05:21

Originally posted by Jeff Wooldridge View Post

I’m a little surprised Stata even allows a log link with a binomial family. The mean function is incompatible with the quasi-likelihood. For lots of outcomes on x and parameter values, the exponential mean function can exceed one. Whenever that happens the log likelihood is ill defined. How is ff2 measured? You need to change the link or the family.

Hi Jef,
I'm actually not a statistician, just run some basic analyses for my own work, so I do not know the background math to understand which log link suits best for which family. I tried it as my intention was to calculate risk ratio.
ff2 is a binary variable which stood for food frequency (adequate, inadequate).
Comment
Sonnen Blume

Join Date: Aug 2018

Posts: 342
#8

13 Jan 2020, 05:25

Originally posted by Nick Cox View Post

It's hard to see the log link could be good for binomial responses. What's the logic there?

Hi Nick, actually I don't have a concrete reason behind using the log function, except the fact that its the only way I was getting RRs, which is my objective in this analysis. I generally go for ORs but was trying to use a new flavour.
Comment

Maarten Buis

Join Date: Mar 2014
Posts: 3458

13 Jan 2020, 06:03

As Jeff said this is quite typical when you get predictions larger than 1. In the example below I estimate this model with a Poisson family rather than a binomial family, and show that it results in predictions larger than 1:

Code:

. sysuse nlsw88, clear
(NLSW, 1988 extract)

. gen urban = smsa + c_city

. label define urban 2 "central city" ///
>                    1 "sub-urban" ///
>                    0 "rural"

. label value urban urban

.
. poisson married i.race i.urban age hours i.south , irr vce(robust) base

Iteration 0:   log pseudolikelihood = -2044.7245  
Iteration 1:   log pseudolikelihood = -2044.7245  

Poisson regression                              Number of obs     =      2,242
                                                Wald chi2(7)      =     169.40
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -2044.7245               Pseudo R2         =     0.0160

-------------------------------------------------------------------------------
              |               Robust
      married |        IRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
         race |
       white  |          1  (base)
       black  |   .6873876   .0334256    -7.71   0.000     .6248997    .7561241
       other  |   1.016493   .1373108     0.12   0.904     .7800484    1.324607
              |
        urban |
       rural  |          1  (base)
   sub-urban  |   1.024406   .0346499     0.71   0.476     .9586958     1.09462
central city  |   .8324088   .0372839    -4.10   0.000     .7624493    .9087874
              |
          age |   .9908103   .0049088    -1.86   0.062     .9812359    1.000478
        hours |   .9911047    .001207    -7.34   0.000     .9887418    .9934732
              |
        south |
           0  |          1  (base)
           1  |    1.11576   .0357806     3.42   0.001      1.04779     1.18814
              |
        _cons |   1.380332   .2750365     1.62   0.106     .9340675    2.039805
-------------------------------------------------------------------------------
Note: _cons estimates baseline incidence rate.

. predict pr
(option n assumed; predicted number of events)
(4 missing values generated)

. sum pr, detail

                 Predicted number of events
-------------------------------------------------------------
      Percentiles      Smallest
 1%     .3714456        .292085
 5%     .3999181       .3078054
10%     .4421122       .3527893       Obs               2,242
25%     .5400519        .352894       Sum of Wgt.       2,242

50%     .6612654                      Mean           .6427297
                        Largest       Std. Dev.      .1363438
75%     .7356569       1.033322
90%     .8012611       1.056671       Variance       .0185896
95%     .8550637       1.121848       Skewness      -.0713335
99%     .9529721       1.132253       Kurtosis       2.656633

If I now try to estimate the GLM with a binomial family, it won't converge:

Code:

. glm married i.race i.urban age hours i.south, link(log) family(binomial)

Iteration 0:   log likelihood = -2210.3622  
Iteration 1:   log likelihood = -1990.8396  
Iteration 2:   log likelihood = -1986.4671  
Iteration 3:   log likelihood = -1986.4624  
Iteration 4:   log likelihood = -1986.4492  
Iteration 5:   log likelihood = -1986.4417  
Iteration 6:   log likelihood = -1986.4363  
Iteration 7:   log likelihood = -1986.4313  
Iteration 8:   log likelihood = -1986.4206  
Iteration 9:   log likelihood = -1986.4107  
Iteration 10:  log likelihood = -1986.4088  
.
.
.
Iteration 1537: log likelihood = -1978.3333  
Iteration 1538: log likelihood = -1978.3331  
Iteration 1539: log likelihood =  -1978.333  
Iteration 1540: log likelihood = -1978.3327  
Iteration 1541: log likelihood = -1978.3325  
Iteration 1542: log likelihood = -1978.3324  
Iteration 1543: log likelihood = -1978.3322  
Iteration 1544: log likelihood =  -1978.332  
Iteration 1545: log likelihood = -1978.3318  
--Break--
r(1);

So in short, you should not do this.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------

Comment

Joseph Coveney

Join Date: Apr 2014

Posts: 4421
#10

13 Jan 2020, 06:22

Originally posted by Maarten Buis View Post

As Jeff said this is quite typical when you get predictions larger than 1. In the example below I estimate this model with a Poisson family rather than a binomial family, and show that it results in predictions larger than 1

But his predictors are

Code:

i.Age i.Education i.Occ i.Residency

Would that be likely in his case?

Or does it require fully interacted categorical predictors in order to assure that the predictions lie within the parameter space?
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3458
#11

13 Jan 2020, 06:39

Originally posted by Joseph Coveney View Post

Or does it require fully interacted categorical predictors in order to assure that the predictions lie within the parameter space?

A fully interacted / fully saturated model is guaranteed to remain between zero and one, as it just exactly reproduces the conditional means. Without those interactions you could get predictions larger than 1, even if you only include categorical variables.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4421
#12

13 Jan 2020, 06:50

Originally posted by Sonnen Blume View Post

Does anyone have any idea how this can be solved.

Based on #11, try interacting your predictors. And then -margins- or -nlcom- afterward to get at the main effects. (Not sure how well this works for nonlinear or generalized linear models, but it's worth a try.)
1 like
Comment
Sonnen Blume

Join Date: Aug 2018

Posts: 342
#13

17 Jan 2020, 10:16

Originally posted by Maarten Buis View Post

A fully interacted / fully saturated model is guaranteed to remain between zero and one, as it just exactly reproduces the conditional means. Without those interactions you could get predictions larger than 1, even if you only include categorical variables.

Thanks a lot Marteen for the interpretation with examples.

About the predicted values, I remember seeing regression plots containing predicted values beyond 0 and 1 in some R tutorials. So may be this condition is relaxable. The funny thing is that the glm command with log link works for upto 3 IVs. (I find it unfair, but a robot is a robot after all)
Comment
Sonnen Blume

Join Date: Aug 2018

Posts: 342
#14

17 Jan 2020, 10:17

Originally posted by Joseph Coveney View Post

Based on #11, try interacting your predictors. And then -margins- or -nlcom- afterward to get at the main effects. (Not sure how well this works for nonlinear or generalized linear models, but it's worth a try.)

Thanks Joseph. I'll surely give it a try.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4421
#15

17 Jan 2020, 18:59

Despite what I wrote, you don't need to use -nlcom-, by the way. It's a generalized linear model, and so you should be get at the marginal (main effects) using regular -lincom-.
Comment

Announcement